library(ggplot2)
library(car)
library(MASS)
library(plotly)
library(olsrr)HW 3 Nonlinear & Nonparametric Regression: Data Analysis Problems
Advanced Regression (STAT 353-0)
All students are required to complete this problem set!
Load packages & data
Data analysis problems
1. Exercise D17.1
The data in Ginzberg.txt (collected by Ginzberg) were analyzed by Monette (1990). The data are for a group of 82 psychiatric patients hospitalized for depression. The response variable in the data set is the patient’s score on the Beck scale, a widely used measure of depression. The explanatory variables are “simplicity” (measuring the degree to which the patient “sees the world in black and white”) and “fatalism”. (These three variables have been adjusted for other explanatory variables that can influence depression.) Use the adjusted scores for the analysis.
Using the full quadratic regression model
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1^2 + \beta_4 X_2^2 + \beta_5 X_1 X_2 + \epsilon\]
regress the Beck-scale scores on simplicity and fatalism.
- Are the quadratic and product terms needed here?
- Graph the data and the fitted regression surface in three dimensions. Do you see any problems with the data?
- What do standard regression diagnostics for influential observations show?
2. Exercise D18.2
For this analysis, use the States.txt data, which includes average SAT scores for each state as the outcome.
- Put together a model with SAT math (
satMath) as the outcome andregion,population,percentTaking, andteacherPayas the explanatory variables, each included as linear terms. Interpret the findings.
- Now, instead approach building this model using the nonparametric-regression methods discussed in Chapter 18 of our main course textbook, FOX. Fit a general nonparametric regression model and an additive-regression model, comparing the results to each other and to the linear least-squares fit to the data (in part (a))). If you have problems with categorical variables for the nonparametric models, feel free to remove them. Be sure to explain the models.
- Can you handle the nonlinearity by a transformation or by another parametric regression model, such as a polynomial regression? Investigate and explain. What are the tradeoffs between these nonparametric and parametric approaches?
3. Exercise D18.3
Return to the Chile.txt dataset used in HW 2. Reanalyze the data employing generalized nonparametric regression (including generalized additive) models. As in HW2, you can remove abstained and undecided votes, and focus only on Yes and No votes.
- What, if anything, do you learn about the data from the nonparametric regression?
- If the results appear to be substantially nonlinear, can you deal with the nonlinearity in a suitably respecified generalized linear model (e.g., by transforming one or more explanatory variables)? If they do not appear nonlinear, still try a transformation to see if anything changes.
4. Exercise E18.7
For this analysis, use the Duncan.txt data. Here we are interested in the outcome prestige and the explanatory variable income.
- Fit the local-linear regression of prestige on income with span \(s = 0.6\) (see Figure 18.7 in the book). This has 5.006 equivalent degrees of freedom, very close to the number of degrees of freedom for a fourth order polynomial.
- Fit a fourth order polynomial of the data and compare the resulting regression curve with the local-linear regression. ::: {.callout-tip icon=“false”} ## Solution
poly_model <- lm(prestige ~ poly(income, 4), data=d_data)
income_grid <- seq(min(d_data$income),
max(d_data$income),
length.out = 100)
# Get predictions for both models
local_pred <- predict(model_loess,
newdata = data.frame(income = income_grid))
poly_pred <- predict(poly_model,
newdata = data.frame(income = income_grid))
plot(d_data$income, d_data$prestige,
xlab = "Income",
ylab = "Prestige",
main = "comparison of Local-Linear and polynomial regression",
pch = 16,col = "gray40")
# Add both fitted lines
lines(income_grid, local_pred,
col = "blue",
lwd = 2,
lty = 1)
lines(income_grid, poly_pred,
col = "red",
lwd = 2,
lty = 2)
legend("topleft",
legend = c("Local-Linear (span=0.6)",
"4th Order Polynomial"),
col = c("blue", "red"),
lwd = 2,
lty = c(1, 2),
bg = "white")cat("\nSummary of fourth-order polynomial fit:\n")
Summary of fourth-order polynomial fit:
print(summary(poly_model))
Call:
lm(formula = prestige ~ poly(income, 4), data = d_data)
Residuals:
Min 1Q Median 3Q Max
-44.590 -9.181 2.714 10.295 60.323
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.689 2.615 18.233 < 2e-16 ***
poly(income, 4)1 175.114 17.545 9.981 2.04e-12 ***
poly(income, 4)2 -14.998 17.545 -0.855 0.398
poly(income, 4)3 -10.084 17.545 -0.575 0.569
poly(income, 4)4 -19.562 17.545 -1.115 0.272
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.55 on 40 degrees of freedom
Multiple R-squared: 0.7181, Adjusted R-squared: 0.69
F-statistic: 25.48 on 4 and 40 DF, p-value: 1.538e-10
# R-squared for both fits for comparison
# For local-linear fit
local_fitted <- predict(model_loess, newdata = data.frame(income = d_data$income))
local_r2 <- 1 - sum((d_data$prestige - local_fitted)^2) /
sum((d_data$prestige - mean(d_data$prestige))^2)
cat("\nR-squared values:\n")
R-squared values:
cat("Local-linear R²:", round(local_r2, 4), "\n")Local-linear R²: 0.7227
cat("Polynomial R²:", round(summary(poly_model)$r.squared, 4), "\n")Polynomial R²: 0.7181
The plot shows that both models capture similar overall patterns in the data, with prestige generally increasing with income but showing some nonlinear behavior.They both start with a steep increase at lower income levels, shows some leveling off at higher incomes, and has a slight decrease at the very highest income levels.
However, the local-linear fit appears slightly smoother, and the polynomial fit shows slightly more curvature, especially in the middle range. The difference in R² is very small (about 0.0046), suggesting comparable overall fit quality. The simpler local-linear model performs slightly better in terms of R²
The polynomial regression statistics show a highly significant first-order term (p < 2e-16). Higher-order terms (2nd through 4th) are not individually significant (p-values > 0.05) :::